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ABSTRACT 


Robust  ststisticsl  methods  have  recently  been  shown  to  have 
desirable  properties  when  need  for  the  identification  of 
similarities  and  differences  in  shape.  Ve  present  a  generalization 
of  the  two-dimens iona 1  repeated  median  algorithm  to  three  and  higher 
dimensions.  The  extension  is  achieved  using  a  duality  between 
orthogonal  and  shew- symmetric  matrices,  which  permits  the  definition 
of  a  median  of  a  collection  of  orthogonal  matrices.  The  methods  are 
illustrated  by  comparing  the  predicted  three-dimensional 
configuration  of  a  protein  molecule  to  a  refined  structure  that  had 
been  found  using  nuclear  magnetic  resonance  techniques. 
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1.  INTRODUCTION 


Many  quantitative  method*  for  the  comparison  of  shape  and  form 
in  two  dimensions  have  been  proposed  since  the  fundamental 
descriptive  work  of  Thompson  (1917).  Sneath  (1967)  used  least 
squares  as  a  basis  for  establishing  a  common  frame  of  reference  for 
the  comparison  of  tvo  objects  and  for  draving  inferences  about  their 
similarities  and  differences.  Robust  estimation  for  this  problem 
using  the  technique  of  repeated  medians  vas  proposed  by  Siegel  and 
Benson  (1982),  and  some  real  advantages  of  robust  methods  over  least 
squares  mere  demonstrated.  Related  theory  and  examples  may  be  found 
in  Siegel  (1982a  and  1982b)  and  in  Olshan,  Siegel,  and  Swindler 
(1982).  Some  additional  contributions  to  the  study  of  shape  and 
form  include  Gould  (1966),  Mosimann  (1970),  Gower  (1975),  and 
Bookstein  (1977). 

Robust  methods  are  often  superior  to  least  squares  in  the 
comparison  of  shape  because  a  localised  difference  in  shape  between 
two  objects  can  be  thought  of  as  an  outlier  in  the  fitting  process. 
Duo  to  its  high  sensitivity  to  outliers,  a  least  squares  fit  will 
tend  to  underplay  the  size  of  such  a  shape  difference,  and  thereby 
render  it  difficult  to  detect.  At  the  same  time,  differences  may 
tend  to  be  exaggerated  at  points  that  would  otherwise  have  been 
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For  example,  Figaro  1.1  shows  the  comparison  of  too 
hypothetical  three-dimensional  geometric  shapes  by  rotation  and 
translation.  The  fitting  process  acts  on  the  nine  homologous  pairs 
of  points,  one  point  in  each  shape,  and  tries  to  bring  point  i  of 
shape  1  close  to  point  i  of  shape  2.  The  shapes  are  identical 
in  8  of  their  9  points,  which  are  placed  at  the  vertices  of  a 
cube,  while  the  last  point  is  different.  As  a  result  of  trying  to 
bring  the  outlying  points  closer  together,  the  least  squares  fit 
suggests  the  existence  of  shape  differences  at  all  9  points,  while 
the  robust  fit  (computed  using  the  methods  to  be  developed  here) 
correctly  indicates  the  closeness  of  the  correspondence  at  the 
vertices  of  the  cnbe.  and  also  indicates  the  full  size  of  the 
difference  at  the  last  point. 

The  main  difficulty  involved  in  extending  the  repeated  median 
technique  for  shape  comparison  from  two  to  three  dimensions  is  that 
the  oomponentwise  median  of  a  set  of  orthogonal  matrices  need  not 
itself  be  an  orthogonal  matrix.  By  working  with  angles  instead  of 
matrices  this  problem  can  be  avoided  in  two  dimensions.  In  Section 

2  we  show  how  the  three-dimensional  rotational  component  of  the  fit 
can  be  obtained  by  medians  using  a  duality  between  orthogonal  and 
shew  symmetric  matrices.  These  methods  are  illustrated  in  Section 

3  using  data  from  the  three-dimensional  configurations  of  related 
protein  molecules  that  have  been  studied  by  Dower  (1979)  using  least 


squares  techniques 
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We  will  assume  that  oar  data  set  consists  of  n  homologous 

points  in  k  dimensions,  denoted  X, ,  ....  X  for  shape  1  and 

X  A 

U,  ,  U  for  shape  2.  In  order  to  transform  the  points  of 

i  n 

e 

shape  2  to  a  oloae  fit  with  the  corresponding  points  of  shape  1, 
we  will  allow  rotations  and  translations,  estimating  an  orthogonal 
rotation  matrix  0  and  a  translation  rector  T  so  that  the 
residnal  rectors 

(1.1)  Ul  -  <T  +  0  X£) 

axe  small  in  magnitude.  A  magnification  factor  m  can  be  included, 
in  which  case  we  would  make  the  residnal  rectors 

(1.2)  -  m  (T  ♦  0  Xt) 

small  in  magnitude  b j  estimating  0,  T,  and  m. 

The  least  squares  solution,  which  minimixes  the  sum  of  the 


squared  lengths  of  the  terms  in  (1.1),  ean  he  eomputed  using  the 
singular  ralue  decomposition  (Huber,  1980). 


2.  THE  THREE-DIMENSIONAL  SEPEATED  MEDIAN  ALGORITHM 


The  repeated  aedian  algorithm,  like  a  U-statistic  (Hoeffding. 
1948),  proceeds  one  paraaeter  at  a  tiae.  Te  first  present  details 
for  estiaation  of  the  orthogonal  aatriz,  then  summarize  the  steps 
for  obtaining  the  translation  and  magnification.  A  preliminary 
least  squares  fit  is  used  as  a  point  of  departure. 

A  subset  of  two  pairs  of  hoaologous  points,  say  points  with 
indices  i  and  j,  froa  each  of  the  two  shapes  (i.e.  and 

Xj  of  shape  1  vith  points  and  of  shape  2) 

is  not  sufficient  to  uniquely  deteraine  a  three-dimensional 
rotation.  Three  pairs  of  hoaologous  points,  for  example  i,  j,  and 
k,  generally  are  sufficient  to  deteraine  such  a  rotation,  although 
different  aethods  will  result  in  slightly  different  rotation 
aatrices.  One  aethod  that  generalizes  easily  to  higher  diaensions 
is  based  on  the  least  squares  fit  of  the  three  points.  However, 
this  will  not  usually  aateh  anything  exactly.  In  order  to  aateh 
soae  aspeota  of  the  data  exactly,  we  will  choose  a  three  by  three 


-6- 


orthogonal  matrix  0^k  (°n*  «»trix  for  each  ordered  triple  i. 


j,  end  k)  so  that 


(2.1)  tke  directions  oi  0^^  X^  -  Xj  )  sad  -  T7i  are 

the  sane 


sad 


(2.2)  the  trsasforaed  point  0^k  X^  is  in  the  saae  plane  as 

the  points  P^,  Uj  ,  and  Pj.,  and  is  on  the  saae  side 
of  the  line  through  and  Pj  as  is  TJ^. 


To  find  we  will  define  rectors 


(2.3) 


,  xrxi 
iJ 


r  .  JLl1±- 


(2.4)  X._  - 


(it“ii)-[(xt-xi) »(Xj-Xi)l  xij 
iJk  B(xk-xi)-[(xk-xi).(xJ-xi)]  xijB 


(2.5)  P,_  - 


(Pk>oi)«tnjk-pi)  ■(pj-ui)i  Tij d 
1Jk  i(pk-pi)-t(pk-pl) *(pj-pl)i  ptJ i 
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It  can  b«  verified  that  the  rotation  matrix  0^,  ^  satisfying 
conditions  (2.1)  and  (2.2)  for  points  i,  j.  and  k  is 


(2.6)  0  «  (U  ,  U  .  U  x  U  ) 

ijk  ij'  ijk  ij  ijk 


(Xij  *  ^jk* 


where  "i"  denotes  the  cross  product  of  taro  rectors. 


The  repeated  median  process  compotes  a  single  matrix  from  these 
n(n-l)  (n-2)  orthogonal  matrices  oaing  a  duality  between 
orthogonal  and  skew  symmetric  matrices,  details  of  which  may  be 
found  in  Bwea  (1966).  The  skew  symmetric  matrix  corresponding  to 

°ijk  14 


(2.7) 


Sijk  "  (°ijk+I>  (0ijk"I) 


where  I  denotes  the  identity  matrix.  Taking  triply  repeated 
medians  of  each  entry,  we  obtain  the  skew  symmetric  matrix  S: 


_  median  .  median  r  median  «  .. 

(J.«)  J  -  J  I  J#i  t  Wi.j  siJk  11 
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where  the  aedian  of  a  set  of  aatricas  is  defined  as  the  aatriz  of 
aedians  eoapnted  at  each  entry.  The  shew  syaaetrie  aatrix  S  is 
then  transforaed  back  to  the  orthogonal  aatrix  0  using  the  inverse 
relation 

(2.9)  0  -  (I+SHI-S)"1 

which  coapletes  the  definition  of  the  repeated  aedian  orthogonal 
rotation  aatrix  0. 

The  translation  vector.  T,  should  be  computed  by  finding  a 
robnst  eatiaate  of  the  three  diaenaional  location  of  the  data 
Uj  -  0  (i«l.....n).  using  the  value  for  0  from  (2.9). 

This  location  aight  be  found  using  the  aedianoentre  (Bedall  and 
Ziaaeraann,  1979).  which  is  the  point  that  ainiaizes  the  sua  of  the 
Euclidean  distances  froa  it.  A  siapler  method  is  to  use  the  vector 
of  univariate  aediana  ooaputed  separately  for  each  coordinate. 

The  aagaif ication  factor  a.  which  is  needed  in  soae 
applications  but  oaitted  in  others,  oan  be  found  using  the  saae 
technique  used  by  Siegel  and  Benaon  (1982).  Because  a  can  be 
eatiaated  as  a  U-statistic  based  on  pairs  of  points  regardless  of 
the  diaensioaali ty  of  the  data,  this  procedure  is  no  aore 
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complicated  in  higher  dimensions  than  it  is  in  two  dimensions.  The 
doubly  repeated  median  of  the  ratios  of  the  lengths  of  homologous 
line  segments  is 

(2 .10)  «  -  *•“*»  (  *'“**  ) 

*  J,li  «x  -iji 

The  breakdown  and  resistance  properties  of  repeated  median 
procedures  as  outlined  in  Siegel  and  Benson  (1982)  and  in  Siegel 
(1982a)  still  hold  with  these  procedures:  the  breakdown  value  is 
approximately  50%.  In  particular,  if  more  than  (n+2)/2  of  the 
points  oan  be  fitted  closely,  then  this  repeated  median  procedure 
will  do  so.  If  a  single  overall  median  is  used  instead,  then  the 

breakdown  value  is  approximately  21%  (this  is  l-.5*^), 
indicating  that  the  overall  median  technique  may  not  indicate 
clearly  a  localized  distortion  involving  more  than  one  fifth  of  the 


points 
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3.  COMPARING  PROTEIN  MOLECULAR  STRUCTURES 


The  determination  and  comparison  of  the  three-dimensional 
configuration  of  protein  molecules  provides  a  setting  for 
illustration  of  the  repeated  median  fitting  method  and  how  it 
relates  to  least  squares.  Dover  (1979)  studied  the  structure  of 
"the  Ft  fragment  of  protein  315.  a  Dnp-binding  BALB/c  mouse 
IgAOL^)  myeloma  protein."  Dover  started  vith  a  predicted 

structure  based  on  previous  studies  of  related  proteins.  This 
initial  configuration  vas  modified  and  refined  until  nuclear 
magnetic  resonance  properties  computed  for  the  modified  structure 
matched  laboratory  data  from  the  protein  fragment  itself.  The 
comparison  of  the  initial  predicted  configuration  to  the  final 
refined  structure  is  of  interest,  and  Dover  used  least  squares 
techniques  as  a  basis  for  interpreting  the  differences. 

Be  fitted  all  50  points  of  the  two  hoaologons  protein 
molecules,  eaoh  point  being  the  center  of  the  alpha  carbon  atom  of 
an  amino  aoid  in  the  protein  chain.  Rotation  and  translation  vere 
allowed,  but  no  magnification  was  fitted  due  to  the  nature  of  this 
problem.  .After  fitting,  residual  vectors  were  computed, 
representing  the  direction  and  amount  of  shape  change  or  distortion 
which  would  be  necessary  at  each  point  to  deform  one  shape  into  the 


other 
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Histograas  of  the  lengths  of  these  reaidaals  ara  shown  ia 
Figaro  3.1.  both  for  the  laast  squares  aad  for  tha  robust  fittiag 
aethoda.  The  expected  relationship  between  least  squares  aad  robast 
aatbods  is  evident;  tba  robast  aathod  eaa  tolerata  a  few  larger 
raaidaals  ia  order  to  achieve  a  closer  fit  elsewhere  thereby 
resaltiag  ia  aore  snail  residaals  thaa  least  sqaares  coaid  achieve. 


Figare  3.2  shows  a  plot  of  the  least  sqaares  residaals  against 
the  repeated  aediaa  residaals.  allowiag  ns  to  see  how  the  residaal 
sixes  have  changed  oa  an  iadividaal  basis  with  the  45  degree  line 
indicated  for  reference.  This  overall  pictare  shows  that  the 
fittiag  aethoda  agree  on  the  identification  of  the  largest  two 
residaals.  althoagh  they  do  not  identify  the  sane  point  as  the  third 
largest. 


The  50  aaino  acids  are  classified  in  Dower  as  belonging  in  six 
distinct  sabgroaps.  Becanse  the  two  largest  residaals  both  belong 
to  the  sixth  gxoap.  this  was  exaaiaed  separately.  Table  3.1  lists 
the  coordinate  aad  residaal  data  for  this  groap  analysed  by  itself. 
Figure  3.3  displays  the  residaal  lengths  for  this  aabgroap  under 
the  two  fitting  aethods.  By  reference  to  the  45  degree  line,  we 
eaa  see  that  all  bat  one  residaal  has  been  reduced  by  the  robast  fit 
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FIGUM  3.1.  Hiatotraaa  of  th*  rialdul  diataaeaa  batwaaa 
hoaologoaa  alpha  oarboa  atoaa  fo*  all  fifty  aalao  aoida,  baaad  oa 
tha  laaat  aqaaxaa  fit  (top)  aad  th#  rapaatad  aadiaa  fit  (bottoa) . 


30i 


0  100  200 
repeated  nedian  fit 


FIGURE  3.2.  Lea at-aqaarea  reaidaala  plotted  agaiast  repeated 
aediaa  residuals  for  all  fifty  aaiao  aoida.  Note  that  aaall 
reaidaala  are  priaarily  above  the  45  degree  liae,  iadicatiag  that 
repeated  aediaaa  have  achieved  a  eloaer  fit  ia  theae  areaa. 
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TABL£  3.1.  Cartaaiaa  eoordiaataa  of  tha  alpha  earboa  atoaa 
of  tfca  alz  aaiao  aeida  of  groap  6, 
aad  tba  orthogonal  aatriz  0  oatiaatad  bp  rapaatad  aadiaaa. 


Iaitial  eoaf igaratioa, 
eaatarad  at  tfco  origin 


Xodifiad  bp  Dovar  to  aateh 
aaelaax  aagaatie  raaoaaaea 
data,  aftar  laaat  aqaaraa  fit 


*2 


*3 


*4 


*5 


*6 


(  297.  -225.  3d4) 


-  (  220.  -195.  -4) 


(  224.  32.  -309) 


-  (-145.  107.  -377) 


-  (-294.  101.  -23) 


-  (-304.  180.  351) 


Xi 


X2 


X3 


X4 


X5 


X6 


-  (  276 *  -122.  347) 


-  (  158,  -221.  -3) 


(  204.  -10.  -316) 


-  (-163.  92.  -337) 


-  (-149,  229,  17) 


-  (-326.  32,  290) 


1  i 
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The  two  fitting  aethods  suggest  different  inte rpr e ts t i ons  of  the 
relationships  of  the  saino  seids  in  gronp  six,  ss  indicated  by  the 
histogrsas  in  Figure  3.4.  Les st  .squares  suggests  a  continuous  but 
shaved  distribution  of  residuals  with  no  clear  outliers,  whereas  the 
robust  fit  aight  suggest  the  presence  of  at  least  one  outlier. 
Curiously,  the  aaino  acid  corresponding  to  the  largest  robust 
residual  does  not  correspond  to  the  largest  least  squares  residual. 

One  interpretation  of  this  configuration  can  be  given  if  the  two 
shapes  in  group  six  differ  priaarily  at  one  point.  In  this  case  the 
robust  fit  will  probably  correctly  identify  this  point  by  its  large 
residual.  Because  the  sua  of  squares  could  not  be  ainiaixed  in  the 
presence  of  such  a  large  residual,  the  least  squares  aethod  would 
probably  select  a  rotation  that  distorts  the  relationship  aaong  the 
other  points  while  bringing  the  outlying  points  closer  together. 


FIGUJtE  3.4.  Hiatograaa  of  tha  rasidaal  diataaeaa  batwaaa 
hoaologoaa  alpha  earboa  atoas  for  sabgroap  aix  fittad  by  itsalf. 
baaad  oa  tha  laaat  aqaaxaa  fit  (top)  aad  tha  rapaatad  aadiaa  fit 
(bottoa) . 
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